skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Patel, Bhavesh"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Point cloud is an important type of geometric data structure for many embedded applications such as autonomous driving and augmented reality. Current Point Cloud Networks (PCNs) have proven to achieve great success in using inference to perform point cloud analysis, including object part segmentation, shape classification, and so on. However, point cloud applications on the computing edge require more than just the inference step. They require an end-to-end (E2E) processing of the point cloud workloads: pre-processing of raw data, input preparation, and inference to perform point cloud analysis. Current PCN approaches to support end-to-end processing of point cloud workload cannot meet the real-time latency requirement on the edge, i.e., the ability of the AI service to keep up with the speed of raw data generation by 3D sensors. Latency for end-to-end processing of the point cloud workloads stems from two reasons: memory-intensive down-sampling in the pre-processing phase and the data structuring step for input preparation in the inference phase. In this paper, we present HgPCN, an end-to-end heterogeneous architecture for real-time embedded point cloud applications. In HgPCN, we introduce two novel methodologies based on spatial indexing to address the two identified bottlenecks. In the Pre-processing Engine of HgPCN, an Octree-Indexed-Sampling method is used to optimize the memory-intensive down-sampling bottleneck of the pre-processing phase. In the Inference Engine, HgPCN extends a commercial DLA with a customized Data Structuring Unit which is based on a Voxel-Expanded Gathering method to fundamentally reduce the workload of the data structuring step in the inference phase. The initial prototype of HgPCN has been implemented on an Intel PAC (Xeon+FPGA) platform. Four commonly available point cloud datasets were used for comparison, running on three baseline devices: Intel Xeon W-2255, Nvidia Xavier NX Jetson GPU, and Nvidia 4060ti GPU. These point cloud datasets were also run on two existing PCN accelerators for comparison: PointACC and Mesorasi. Our results show that for the inference phase, depending on the dataset size, HgPCN achieves speedup from 1.3× to 10.2× vs. PointACC, 2.2× to 16.5× vs. Mesorasi, and 6.4× to 21× vs. Jetson NX GPU. Along with optimization of the memory-intensive down-sampling bottleneck in pre-processing phase, the overall latency shows that HgPCN can reach the real-time requirement by providing end-to-end service with keeping up with the raw data generation rate. 
    more » « less
    Free, publicly-accessible full text available November 2, 2025
  2. null (Ed.)
    Large Convolutional Neural Networks (CNNs) are often pruned and compressed to reduce the amount of parameters and memory requirement. However, the resulting irregularity in the sparse data makes it difficult for FPGA accelerators that contains systolic arrays of Multiply-and-Accumulate (MAC) units, such as Intel’s FPGA-based Deep Learning Accelerator (DLA), to achieve their maximum potential. Moreover, FPGAs with low-bandwidth off-chip memory could not satisfy the memory bandwidth requirement for sparse matrix computation. In this paper, we present 1) a sparse matrix packing technique that condenses sparse inputs and filters before feeding them into the systolic array of MAC units in the Intel DLA, and 2) a customization of the Intel DLA which allows the FPGA to efficiently utilize a high bandwidth memory (HBM2) integrated in the same package. For end-to-end inference with randomly pruned ResNet-50/MobileNet CNN models, our experiments demonstrate 2.7x/3x performance improvement compared to an FPGA with DDR4, 2.2x/2.1x speedup against a server-class Intel SkyLake CPU, and comparable performance with 1.7x/2x power efficiency gain as compared to an NVidia V100 GPU. 
    more » « less
  3. SCAIGATE is an ambitious project to design the first AI-centric science gateway based on field-programmable gate arrays (FPGAs). The goal is to democratize access to FPGAs and AI in scientific computing and related applications. When completed, the project will enable the large-scale deployment and use of machine learning models on AI-centric FPGA platforms, allowing increased performance-efficiency, reduced development effort, and customization at unprecedented scale, all while simplifying ease-of-use in science domains which were previously AI-lagging. 
    more » « less
  4. Field-programmable gate arrays (FPGAs) have largely been used in communication and high-performance computing and given the recent advances in big data and emerging trends in cloud computing (e.g., serverless [18]), FPGAs are increasingly being introduced into these domains (e.g., Microsoft’s datacenters [6] and Amazon Web Services [10]). To address these domains’ processing needs, recent research has focused on using FPGAs to accelerate workloads, ranging from analytics and machine learning to databases and network function virtualization. In this paper, we present an ongoing effort to realize a high-performance FPGA-as-a-microservice (FaaM) architecture for the cloud. We discuss some of the technical challenges and propose several solutions for efficiently integrating FPGAs into virtualized environments. Our case study deploying a multithreaded, multi-user compression as a microservice using the FaaM architecture indicate that microservices-based FPGA acceleration can sustain high-performance compared to straightforward implementation with minimal to no communication overhead despite the hardware abstraction. 
    more » « less